Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run on RTX 4090 #38

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

lapp0
Copy link

@lapp0 lapp0 commented Nov 29, 2024

Changes to make the code run on RTX 4090 / 3090.

Fixes #29

Runs in 2 hours 3 minutes, Runs range from 3.275 to 3.285. This finished at 3.2817,

These settings are intended to replicate the training dynamics on 8xH100, not to be optimal. This is accomplished by halving the sequence length and doubling the batch size. You can train in ~90 minutes by setting batch_size of 1.

#38 (comment)

@KellerJordan
Copy link
Owner

KellerJordan commented Nov 30, 2024

Thanks for your work. I will have to think about this and understand the options before merging this.

One thing I would merge immediately would be a reproducible log generated by this. It can be put in the records folder and linked to in the README, to help people who wanna run on 4090. I might prefer that, since it would avoid adding options to the code.

@lapp0
Copy link
Author

lapp0 commented Nov 30, 2024

Perhaps a reasonable alternative is

    1. Expose hyperparameters as command arguments via argparse
    1. Document the command which enables 4090 trains

@vgoklani
Copy link

vgoklani commented Dec 2, 2024

@lapp0 Hey what does do? Thanks!

        flex_kernel_options={
            "BLOCK_M": 64, "BLOCK_N": 64,  # forward
            "BLOCK_M1": 32, "BLOCK_N1": 64, "BLOCK_M2": 64, "BLOCK_N2": 32  # backwards
        }

Presumably you are passing values to the Triton kernels, but how did you come up with those values? Thanks!

@lapp0
Copy link
Author

lapp0 commented Dec 2, 2024

@vgoklani I stole the values from pytorch/pytorch#133254 (comment)

@banyan-god
Copy link

Perhaps a reasonable alternative is

* 1. Expose hyperparameters as command arguments via argparse

* 2. Document the command which enables 4090 trains

definitely prefer this over how it is implemented now, especially if you want to use multiple 4090 and lets say we want to keep the context length same but change the batch size instead .

@lapp0
Copy link
Author

lapp0 commented Dec 4, 2024

Updated code and docs. You can now specify GPTConfig and Hyperparameters as command line arguments, e.g.

torchrun --standalone --nproc_per_node=4 train_gpt2.py \
    --gpt.flex_kernel_consumer True --train.sequence_length 32768 --train.batch_size 16

Rendered Docs

Unrelated: added n_intermediate to GPTConfig which defaults to n_embd * 4

@KellerJordan
Copy link
Owner

I linked to this from the README. I don't think I'll merge it because I don't want to add any arguments to train_gpt2.py.

@lapp0
Copy link
Author

lapp0 commented Dec 5, 2024

Perhaps we could have train_gpt2.py be

def train(model_config: GPTConfig, args: Hyperparameters):
    ...
    
if __name__ == "__main__":
    train(GPTConfig(), Hyperparameters())

This is equivalent to the current code in master except it puts the trainer in a function, parameterized by the model config and training arguments, cleaning the code and making it easier to employ hyperparameter search tools.

Then we can create train_gpt2_cli.py which imports train from train_gpt2.py and handles arg parsing.

@vak
Copy link

vak commented Dec 13, 2024

@lapp0 many thanks for your fork!

I've started your fork successfully on a single RTX 4090 with minor version upgrades and a tiny fix in log-output writing:

➜  modded-nanogpt git:(pr-38) ✗ git diff
diff --git a/Dockerfile b/Dockerfile
index 4ca8600..57fcced 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,7 +1,7 @@
-FROM nvidia/cuda:12.6.2-cudnn-devel-ubuntu24.04
+FROM nvidia/cuda:12.6.3-cudnn-devel-ubuntu24.04
 
 ENV DEBIAN_FRONTEND=noninteractive
-ENV PYTHON_VERSION=3.12.7
+ENV PYTHON_VERSION=3.12.8
 ENV PATH=/usr/local/bin:$PATH
 
 RUN apt update && apt install -y --no-install-recommends build-essential libssl-dev zlib1g-dev \
@@ -27,7 +27,7 @@ WORKDIR /modded-nanogpt
 RUN python -m pip install --upgrade pip && \
     pip install -r requirements.txt
 
-RUN pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124 --upgrade
+RUN pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --upgrade
 
 CMD ["bash"]
 ENTRYPOINT []
diff --git a/run.sh b/run.sh
index 7bb09f4..04e486d 100755
--- a/run.sh
+++ b/run.sh
@@ -1 +1 @@
-torchrun --standalone --nproc_per_node=8 train_gpt2.py
+torchrun --standalone --nproc_per_node=1 train_gpt2.py
diff --git a/train_gpt2.py b/train_gpt2.py
index 6f7f74e..35b1d87 100644
--- a/train_gpt2.py
+++ b/train_gpt2.py
@@ -437,7 +437,7 @@ def train(model_config: GPTConfig, args: Hyperparameters):
             with open(logfile, "a") as f:
                 if not logonly:
                     print(s)
-                f.write(s+'\n')
+                f.write(str(s)+'\n')
     # log information about the hardware/software environment this is running on
     # and print the full `nvidia-smi` to file
     print0(f"Running pytorch {torch.version.__version__} compiled for CUDA {torch.version.cuda}\nnvidia-smi:")

However I am running out of GPU memory. Probably because I use single GPU setup?

If you have an idea what parameters to tweak to fit in a single RTX 4090 24GB, please let me know,
thank you!

@lapp0 lapp0 closed this Dec 16, 2024
@lapp0 lapp0 reopened this Dec 16, 2024
@lapp0
Copy link
Author

lapp0 commented Dec 16, 2024

@vak you can always decrease sequence length further and increase batch size in the same ratio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A speedrun on consumer grade cards?
5 participants